{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "[![AWS SDK for pandas](_static/logo.png \"AWS SDK for pandas\")](https://github.com/aws/aws-sdk-pandas)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# 29 - S3 Select"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "AWS SDK for pandas supports [Amazon S3 Select](https://aws.amazon.com/blogs/aws/s3-glacier-select/), enabling applications to use SQL statements in order to query and filter the contents of a single S3 object. It works on objects stored in CSV, JSON or Apache Parquet, including compressed and large files of several TBs.\n",
    "\n",
    "With S3 Select, the query workload is delegated to Amazon S3, leading to lower latency and cost, and to higher performance (up to 400% improvement). This is in comparison with other awswrangler operations such as `read_parquet` where the S3 object is downloaded and filtered on the client-side.\n",
    "\n",
    "This feature has a number of limitations however:\n",
    "\n",
    "- The maximum length of a record in the input or result is 1 MB\n",
    "- The maximum uncompressed row group size is 256 MB (Parquet only)\n",
    "- It can only emit nested data in JSON format\n",
    "- Certain SQL operations are not supported (e.g. ORDER BY)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Read multiple Parquet files from an S3 prefix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>vendor_id</th>\n      <th>pickup_at</th>\n      <th>dropoff_at</th>\n      <th>passenger_count</th>\n      <th>trip_distance</th>\n      <th>rate_code_id</th>\n      <th>store_and_fwd_flag</th>\n      <th>pickup_location_id</th>\n      <th>dropoff_location_id</th>\n      <th>payment_type</th>\n      <th>fare_amount</th>\n      <th>extra</th>\n      <th>mta_tax</th>\n      <th>tip_amount</th>\n      <th>tolls_amount</th>\n      <th>improvement_surcharge</th>\n      <th>total_amount</th>\n      <th>congestion_surcharge</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>2</td>\n      <td>2019-01-01T00:48:10.000Z</td>\n      <td>2019-01-01T01:36:58.000Z</td>\n      <td>1</td>\n      <td>31.570000</td>\n      <td>1</td>\n      <td>N</td>\n      <td>138</td>\n      <td>138</td>\n      <td>2</td>\n      <td>82.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.00</td>\n      <td>0.00</td>\n      <td>0.3</td>\n      <td>83.800003</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2</td>\n      <td>2019-01-01T00:38:36.000Z</td>\n      <td>2019-01-01T01:21:33.000Z</td>\n      <td>2</td>\n      <td>33.189999</td>\n      <td>5</td>\n      <td>N</td>\n      <td>107</td>\n      <td>265</td>\n      <td>1</td>\n      <td>121.0</td>\n      <td>0.0</td>\n      <td>0.0</td>\n      <td>0.08</td>\n      <td>10.50</td>\n      <td>0.3</td>\n      <td>131.880005</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>2</td>\n      <td>2019-01-01T00:10:43.000Z</td>\n      <td>2019-01-01T01:23:59.000Z</td>\n      <td>1</td>\n      <td>33.060001</td>\n      <td>1</td>\n      <td>N</td>\n      <td>243</td>\n      <td>42</td>\n      <td>2</td>\n      <td>92.0</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.00</td>\n      <td>5.76</td>\n      <td>0.3</td>\n      <td>99.059998</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>1</td>\n      <td>2019-01-01T00:13:17.000Z</td>\n      <td>2019-01-01T01:06:13.000Z</td>\n      <td>1</td>\n      <td>44.099998</td>\n      <td>5</td>\n      <td>N</td>\n      <td>132</td>\n      <td>265</td>\n      <td>2</td>\n      <td>150.0</td>\n      <td>0.0</td>\n      <td>0.0</td>\n      <td>0.00</td>\n      <td>0.00</td>\n      <td>0.3</td>\n      <td>150.300003</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>2</td>\n      <td>2019-01-01T00:29:11.000Z</td>\n      <td>2019-01-01T01:29:05.000Z</td>\n      <td>2</td>\n      <td>31.100000</td>\n      <td>1</td>\n      <td>N</td>\n      <td>169</td>\n      <td>201</td>\n      <td>1</td>\n      <td>85.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.00</td>\n      <td>7.92</td>\n      <td>0.3</td>\n      <td>94.720001</td>\n      <td>NaN</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
      "text/plain": "  vendor_id                 pickup_at                dropoff_at  \\\n0         2  2019-01-01T00:48:10.000Z  2019-01-01T01:36:58.000Z   \n1         2  2019-01-01T00:38:36.000Z  2019-01-01T01:21:33.000Z   \n2         2  2019-01-01T00:10:43.000Z  2019-01-01T01:23:59.000Z   \n3         1  2019-01-01T00:13:17.000Z  2019-01-01T01:06:13.000Z   \n4         2  2019-01-01T00:29:11.000Z  2019-01-01T01:29:05.000Z   \n\n   passenger_count  trip_distance rate_code_id store_and_fwd_flag  \\\n0                1      31.570000            1                  N   \n1                2      33.189999            5                  N   \n2                1      33.060001            1                  N   \n3                1      44.099998            5                  N   \n4                2      31.100000            1                  N   \n\n   pickup_location_id  dropoff_location_id payment_type  fare_amount  extra  \\\n0                 138                  138            2         82.5    0.5   \n1                 107                  265            1        121.0    0.0   \n2                 243                   42            2         92.0    0.5   \n3                 132                  265            2        150.0    0.0   \n4                 169                  201            1         85.5    0.5   \n\n   mta_tax  tip_amount  tolls_amount  improvement_surcharge  total_amount  \\\n0      0.5        0.00          0.00                    0.3     83.800003   \n1      0.0        0.08         10.50                    0.3    131.880005   \n2      0.5        0.00          5.76                    0.3     99.059998   \n3      0.0        0.00          0.00                    0.3    150.300003   \n4      0.5        0.00          7.92                    0.3     94.720001   \n\n   congestion_surcharge  \n0                   NaN  \n1                   NaN  \n2                   NaN  \n3                   NaN  \n4                   NaN  "
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import awswrangler as wr\n",
    "\n",
    "df = wr.s3.select_query(\n",
    "    sql='SELECT * FROM s3object s where s.\"trip_distance\" > 30',\n",
    "    path=\"s3://ursa-labs-taxi-data/2019/01/\",\n",
    "    input_serialization=\"Parquet\",\n",
    "    input_serialization_params={},\n",
    ")\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Read full CSV file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>question</th>\n",
       "      <th>product_description</th>\n",
       "      <th>image_url</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Will the volca sample get me a girlfriend?</td>\n",
       "      <td>Korg Amplifier Part VOLCASAMPLE</td>\n",
       "      <td>http://ecx.images-amazon.com/images/I/81I1XZea...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Can u communicate with spirits even on Saturday?</td>\n",
       "      <td>Winning Moves Games Classic Ouija</td>\n",
       "      <td>http://ecx.images-amazon.com/images/I/81kcYEG5...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>I won't get hunted right?</td>\n",
       "      <td>Winning Moves Games Classic Ouija</td>\n",
       "      <td>http://ecx.images-amazon.com/images/I/81kcYEG5...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>I have a few questions.. Can you get possessed...</td>\n",
       "      <td>Winning Moves Games Classic Ouija</td>\n",
       "      <td>http://ecx.images-amazon.com/images/I/81kcYEG5...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Has anyone asked where the treasure is? What w...</td>\n",
       "      <td>Winning Moves Games Classic Ouija</td>\n",
       "      <td>http://ecx.images-amazon.com/images/I/81kcYEG5...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            question  \\\n",
       "0         Will the volca sample get me a girlfriend?   \n",
       "1   Can u communicate with spirits even on Saturday?   \n",
       "2                          I won't get hunted right?   \n",
       "3  I have a few questions.. Can you get possessed...   \n",
       "4  Has anyone asked where the treasure is? What w...   \n",
       "\n",
       "                 product_description  \\\n",
       "0    Korg Amplifier Part VOLCASAMPLE   \n",
       "1  Winning Moves Games Classic Ouija   \n",
       "2  Winning Moves Games Classic Ouija   \n",
       "3  Winning Moves Games Classic Ouija   \n",
       "4  Winning Moves Games Classic Ouija   \n",
       "\n",
       "                                           image_url label  \n",
       "0  http://ecx.images-amazon.com/images/I/81I1XZea...     1  \n",
       "1  http://ecx.images-amazon.com/images/I/81kcYEG5...     1  \n",
       "2  http://ecx.images-amazon.com/images/I/81kcYEG5...     1  \n",
       "3  http://ecx.images-amazon.com/images/I/81kcYEG5...     1  \n",
       "4  http://ecx.images-amazon.com/images/I/81kcYEG5...     1  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = wr.s3.select_query(\n",
    "    sql=\"SELECT * FROM s3object\",\n",
    "    path=\"s3://humor-detection-pds/Humorous.csv\",\n",
    "    input_serialization=\"CSV\",\n",
    "    input_serialization_params={\n",
    "        \"FileHeaderInfo\": \"Use\",\n",
    "        \"RecordDelimiter\": \"\\r\\n\",\n",
    "    },\n",
    "    scan_range_chunk_size=1024 * 1024 * 32,  # override range of bytes to query, by default 1Mb\n",
    "    use_threads=True,\n",
    ")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Filter JSON file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>family_name</th>\n",
       "      <th>contact_details</th>\n",
       "      <th>name</th>\n",
       "      <th>links</th>\n",
       "      <th>gender</th>\n",
       "      <th>image</th>\n",
       "      <th>identifiers</th>\n",
       "      <th>other_names</th>\n",
       "      <th>sort_name</th>\n",
       "      <th>images</th>\n",
       "      <th>given_name</th>\n",
       "      <th>birth_date</th>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Biden</td>\n",
       "      <td>[{'type': 'twitter', 'value': 'joebiden'}]</td>\n",
       "      <td>Joseph Biden, Jr.</td>\n",
       "      <td>[{'note': 'Wikipedia (ace)', 'url': 'https://a...</td>\n",
       "      <td>male</td>\n",
       "      <td>https://theunitedstates.io/images/congress/ori...</td>\n",
       "      <td>[{'identifier': 'B000444', 'scheme': 'bioguide...</td>\n",
       "      <td>[{'lang': None, 'name': 'Joe Biden', 'note': '...</td>\n",
       "      <td>Biden, Joseph</td>\n",
       "      <td>[{'url': 'https://theunitedstates.io/images/co...</td>\n",
       "      <td>Joseph</td>\n",
       "      <td>1942-11-20</td>\n",
       "      <td>64239edf-8e06-4d2d-acc0-33d96bc79774</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  family_name                             contact_details               name  \\\n",
       "0       Biden  [{'type': 'twitter', 'value': 'joebiden'}]  Joseph Biden, Jr.   \n",
       "\n",
       "                                               links gender  \\\n",
       "0  [{'note': 'Wikipedia (ace)', 'url': 'https://a...   male   \n",
       "\n",
       "                                               image  \\\n",
       "0  https://theunitedstates.io/images/congress/ori...   \n",
       "\n",
       "                                         identifiers  \\\n",
       "0  [{'identifier': 'B000444', 'scheme': 'bioguide...   \n",
       "\n",
       "                                         other_names      sort_name  \\\n",
       "0  [{'lang': None, 'name': 'Joe Biden', 'note': '...  Biden, Joseph   \n",
       "\n",
       "                                              images given_name  birth_date  \\\n",
       "0  [{'url': 'https://theunitedstates.io/images/co...     Joseph  1942-11-20   \n",
       "\n",
       "                                     id  \n",
       "0  64239edf-8e06-4d2d-acc0-33d96bc79774  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wr.s3.select_query(\n",
    "    sql=\"SELECT * FROM s3object[*] s where s.\\\"family_name\\\" = 'Biden'\",\n",
    "    path=\"s3://awsglue-datasets/examples/us-legislators/all/persons.json\",\n",
    "    input_serialization=\"JSON\",\n",
    "    input_serialization_params={\n",
    "        \"Type\": \"Document\",\n",
    "    },\n",
    ")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.14",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}